Sunah Park, 3. December 2018


Executive Summary

This document is submitted as a final project of Linear Regression and Modeling course as a part of Coursera Statistics with R specialiation. The project utilizes multiple linear regression to analyse the provided data.

Research question:

  1. Is there linear relationship between the scores given by the audience and runtime of movie or the fact whether or not the movie director has ever won an Oscar?

  2. How can we predict the audience score regarding these two variables?


Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016.

Inference generalizability and causality:
The random sampling used for the data collection allows us to generalize the results of the analysis to the population. However, as it is an observational study with no random assignment, we cannot assure the causality between variables.

More Info:
Rotten Tomatoes
IMDB


Part 2: Research question

  1. Is there linear relationship between the scores given by the audience and runtime of movie or the fact whether or not the movie director has ever won an Oscar?

  2. How can we predict the audience score regarding these two variables?

The response variable for this analysis is the audience score(audience_score) and the two explanatory variables are the runtime of movie(runtime) and the fact whether or not the movie director ever won an Oscar(best_dir_win).


Part 3: Exploratory data analysis

Setup

rm(list=ls())
# default r markdown global options in document
knitr::opts_chunk$set(
   ########## Text results ##########
    echo=TRUE, 
    warning=FALSE, # to preserve warnings in the output 
    error=FALSE, # to preserve errors in the output
    message=FALSE, # to preserve messages
    strip.white=TRUE, # to remove the white lines in the beginning or end of a source chunk in the output 

    ########## Cache ##########
    cache=TRUE,
   
    ########## Plots ##########
    fig.path="", # prefix to be used for figure filenames
    fig.width=8,
    fig.height=6,
    dpi=200
)

Load Libraries

library(tidyverse) # load tidyverse library
library(ggplot2)
library(statsr) # load statsr library
#library(gplots) # load gplots library
library(knitr)

Load Data

knitr::opts_chunk$set(echo = TRUE,cache=TRUE)
setwd("~/Desktop/Coursera/Statistics_with_R/labs_Sunah/3.4_Assignment") # set the working directory
load("movies.Rdata")
df<-as.data.frame(movies) # assign brfss2013 as data frame with the name 'df'

Quick view of the loaded data

dim(df)
## [1] 651  32
head(df,2) # first 2 rows of the data frame
##         title   title_type genre runtime mpaa_rating                studio
## 1 Filly Brown Feature Film Drama      80           R   Indomina Media Inc.
## 2    The Dish Feature Film Drama     101       PG-13 Warner Bros. Pictures
##   thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month
## 1          2013              4           19         2013             7
## 2          2001              3           14         2001             8
##   dvd_rel_day imdb_rating imdb_num_votes  critics_rating critics_score
## 1          30         5.5            899          Rotten            45
## 2          28         7.3          12285 Certified Fresh            96
##   audience_rating audience_score best_pic_nom best_pic_win best_actor_win
## 1         Upright             73           no           no             no
## 2         Upright             81           no           no             no
##   best_actress_win best_dir_win top200_box         director         actor1
## 1               no           no         no Michael D. Olmos Gina Rodriguez
## 2               no           no         no        Rob Sitch      Sam Neill
##             actor2               actor3        actor4              actor5
## 1     Jenni Rivera Lou Diamond Phillips Emilio Rivera Joseph Julian Soria
## 2 Kevin Harrington    Patrick Warburton      Tom Long      Genevieve Mooy
##                               imdb_url
## 1 http://www.imdb.com/title/tt1869425/
## 2 http://www.imdb.com/title/tt0205873/
##                                         rt_url
## 1 //www.rottentomatoes.com/m/filly_brown_2012/
## 2             //www.rottentomatoes.com/m/dish/
colnames(df)
##  [1] "title"            "title_type"       "genre"           
##  [4] "runtime"          "mpaa_rating"      "studio"          
##  [7] "thtr_rel_year"    "thtr_rel_month"   "thtr_rel_day"    
## [10] "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [13] "imdb_rating"      "imdb_num_votes"   "critics_rating"  
## [16] "critics_score"    "audience_rating"  "audience_score"  
## [19] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [22] "best_actress_win" "best_dir_win"     "top200_box"      
## [25] "director"         "actor1"           "actor2"          
## [28] "actor3"           "actor4"           "actor5"          
## [31] "imdb_url"         "rt_url"

Visualization

df<-df %>%
    select(audience_score, director, runtime, best_dir_win) %>%
    arrange(director) %>%
    na.omit

ggplot(data=df, aes(runtime))+geom_histogram(aes(runtime),alpha=0.4, bins=30)+theme_bw(base_family="Avenir")+labs(x="Runtime in minutes", y="Count", title="")+guides(color=guide_legend(title="Director ever won an Oscar"))+theme(legend.position="bottom")+geom_vline(xintercept=mean(df$runtime), linetype="dotted",color="black")

ggplot(data=df, aes(x=runtime, y=audience_score))+geom_jitter(alpha=.5, aes(color=best_dir_win))+theme_bw(base_family="Avenir")+labs(x="Runtime in minutes", y="Audience score", title="")+guides(color=guide_legend(title="Director ever won an Oscar"))+theme(legend.position="bottom")+geom_smooth(method="lm", size=0.5, color="grey")

ggplot(data=df, aes(x=runtime,y=audience_score, color=best_dir_win))+geom_boxplot()+theme_bw(base_family="Avenir")+facet_wrap(.~best_dir_win)+labs(x="Runtime in minutes", y="Audience score", title="")+guides(color=guide_legend(title="Director ever won an Oscar"))+theme(legend.position="bottom")

summary(df$runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    39.0    92.0   102.5   105.7   115.0   267.0
df_no<-df %>% filter(best_dir_win=="no")
df_yes<-df %>% filter(best_dir_win=="yes")
summary(df_no);summary(df_yes)
##  audience_score    director            runtime      best_dir_win
##  Min.   :11.00   Length:605         Min.   : 39.0   no :605     
##  1st Qu.:45.00   Class :character   1st Qu.: 92.0   yes:  0     
##  Median :65.00   Mode  :character   Median :101.0               
##  Mean   :61.82                      Mean   :104.4               
##  3rd Qu.:79.00                      3rd Qu.:114.0               
##  Max.   :96.00                      Max.   :267.0
##  audience_score    director            runtime    best_dir_win
##  Min.   :27.00   Length:43          Min.   : 84   no : 0      
##  1st Qu.:53.00   Class :character   1st Qu.:106   yes:43      
##  Median :73.00   Mode  :character   Median :122               
##  Mean   :69.51                      Mean   :124               
##  3rd Qu.:85.50                      3rd Qu.:138               
##  Max.   :97.00                      Max.   :202

Findings from EDA:

  1. The interquartile range of the movie runtime is 92 minutes to 115 minutes. The mean value of the runtime is 105 minutes as illustrated with the dotted line in the histogram. The audience score in the interquartile range of runtime varies quite a lot as shown in the point plot. We can already observe that there is a weak linearity between the audience score and runtime, but the study will still continue the linear modeling with the aim to learn from the results.

  2. In total there are 43 movies which are filmed by directors who have ever won Oscar(From now on, this group will be called as Yes-Oscar). The rest of 605 movies were filmed by those who have never won Oscar(From now on, this group will be called as No-Oscar). The range of audience score is wider in No-Oscar groups(11-96) compared to the Yes-Oscar (27-97). The mean and the median values of Yes-Oscar movies are higher than the No-Oscar movies. However, as the boxplot illustrates, there could exist some dependency between runtime and Oscar won.


Part 4: Modeling

fit<-lm(audience_score~runtime+best_dir_win, data=df) # multiple linear model
summary(fit)
## 
## Call:
## lm(formula = audience_score ~ runtime + best_dir_win, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.185 -16.135   3.004  17.380  34.924 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     43.58806    4.42561   9.849  < 2e-16 ***
## runtime          0.17455    0.04166   4.190 3.17e-05 ***
## best_dir_winyes  4.28297    3.24821   1.319    0.188    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.93 on 645 degrees of freedom
## Multiple R-squared:  0.03521,    Adjusted R-squared:  0.03222 
## F-statistic: 11.77 on 2 and 645 DF,  p-value: 9.527e-06
anova(fit)
## Analysis of Variance Table
## 
## Response: audience_score
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## runtime        1   8656  8656.4 21.8036 3.676e-06 ***
## best_dir_win   1    690   690.3  1.7386    0.1878    
## Residuals    645 256075   397.0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model selection: Stepwise backward p-value elimination

For the model selection, the stepwise backward p-value elimination model was used. The process of the model starts with the full model. If the p-value is above the significant level, here we consider 0.05, then that variable will be dropped. Afterwards we refit the model as well as repeat the process. If the largest p-value is less than the significance level, we would not eliminate any predictors and the current model would be the best-fitting model. As the p-value of best_dir_win from the full model is greater than the significant level of 0.05, we drop this variable and refit the model.

fit2<-lm(audience_score~runtime, data=df) # refitting after dropping best_dir_win
summary(fit2)
## 
## Call:
## lm(formula = audience_score ~ runtime, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.638 -15.711   3.011  17.126  34.953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.41771    4.33816   9.778  < 2e-16 ***
## runtime      0.18831    0.04035   4.667 3.72e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.94 on 646 degrees of freedom
## Multiple R-squared:  0.03261,    Adjusted R-squared:  0.03112 
## F-statistic: 21.78 on 1 and 646 DF,  p-value: 3.722e-06
anova(fit2)
## Analysis of Variance Table
## 
## Response: audience_score
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## runtime     1   8656  8656.4  21.779 3.722e-06 ***
## Residuals 646 256765   397.5                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is now less than chosen significance level of 0.05. Therefore, we can choose this as our final model and accordingly we can set the final equation of our model as follows: \[ {score_{audience}}=42.42+0.19\times{runtime} \]

Optional AIC strategy for model selection

Alternatively, we can use the built-in step function using AIC strategy from statsr package for the model selection.

step<-step(fit, direction="backward", trace=TRUE, steps=100) # Choosing a model by AIC in a stepwise algorithm
## Start:  AIC=3880.61
## audience_score ~ runtime + best_dir_win
## 
##                Df Sum of Sq    RSS    AIC
## - best_dir_win  1     690.3 256765 3880.4
## <none>                      256075 3880.6
## - runtime       1    6971.4 263046 3896.0
## 
## Step:  AIC=3880.35
## audience_score ~ runtime
## 
##           Df Sum of Sq    RSS    AIC
## <none>                 256765 3880.4
## - runtime  1    8656.4 265421 3899.8

It yields the same model selection.

Diagnostics for Multiple Linear Regression

Now we diagnose our final model according to the following conditions.

Condition 1. Linear relationship between variables The explanatory variable is linearly related to the response variable. For that we look for a random scatter around 0.

df_con<-as.data.frame(cbind(fit2$residuals,df$runtime))
colnames(df_con)<-c("residuals","runtime")

ggplot(df_con,aes(x=runtime,y=residuals))+geom_point(alpha=0.5)+theme_bw(base_family="Avenir")+geom_hline(yintercept=0, color="red",linetype="solid",size=0.2)

Diagnostic: As shortly seen in the exploratory data analysis above, the explanatory and response variable does not show a linear trend.

Condition 2. Nearly normal residuals with mean 0

ggplot(df_con,aes(residuals))+geom_histogram()

ggplot(df_con, aes(sample=residuals))+geom_qq(size=0.5)+geom_qq_line(size=0.5)

Diagnostic: There is a left skew.

Condition 3. Constant variability of residuals Residuals should be equally variable for low and high values of the predicted response variable. Residuals should be randomly scattered in a band with a constant width around 0.

# Checking using residual plots of residuals vs. predicted:
ggplot(df_con,aes(x=fit2$fitted.values, y=residuals))+geom_point(alpha=0.5)+geom_hline(yintercept=0, color="red",linetype="solid",size=0.2)

# Viewing absolute value of residuals vs. predicted to identify unusual observations:
ggplot(df_con,aes(x=fit2$fitted.values, y=abs(residuals)))+geom_point(alpha=0.5)+geom_hline(yintercept=0, color="red",linetype="solid",size=0.2)

Diagnostic: The variability of residuals is non-constant.

Condition 4. Independence of residuals

ggplot(df_con, aes(x=runtime, y=residuals))+geom_point()

Diagnostic: The observations are not independent.


Part 5: Prediction

We perform a prediction to test the final model.

newmovie<-data.frame(runtime=seq(30,300,15)) # Data frame of movies with runtimes between 30 min and 300 min with 15 minutes of interval. 
pred<-predict.lm(fit2,newmovie, interval="prediction", level=0.95) # Prediction of the model
ndf<-data.frame(runtime=seq(30,300,15),
    fit<-pred[,1],
    lwr<-pred[,2],
    upr<-pred[,3])
colnames(ndf)<-c("runtime","Expected score","lwr","upr"); ndf
##    runtime Expected score       lwr       upr
## 1       30       48.06699  8.431365  87.70260
## 2       45       50.89162 11.418483  90.36476
## 3       60       53.71626 14.370353  93.06217
## 4       75       56.54090 17.286634  95.79517
## 5       90       59.36554 20.167075  98.56400
## 6      105       62.19018 23.011524 101.36883
## 7      120       65.01482 25.819926 104.20971
## 8      135       67.83945 28.592326 107.08658
## 9      150       70.66409 31.328867 109.99932
## 10     165       73.48873 34.029790 112.94767
## 11     180       76.31337 36.695427 115.93131
## 12     195       79.13801 39.326203 118.94981
## 13     210       81.96265 41.922622 122.00267
## 14     225       84.78729 44.485270 125.08930
## 15     240       87.61192 47.014799 128.20905
## 16     255       90.43656 49.511926 131.36120
## 17     270       93.26120 51.977422 134.54498
## 18     285       96.08584 54.412106 137.75957
## 19     300       98.91048 56.816833 141.00412

As an example, the model predicts for a movie with 120 minutes of runtime to have an audience score of 65. With 95% confidence we can say that the movie is expected to have an audience score from 25.8 to 100 (the upper limit).


Part 6: Conclusion

Through the multiple linear regression model and the executed model selection it was found that whether or not the movie director has ever won an Oscar is not a significant variable for the audience score.

The fitted model for runtime and audience score of the movie yields an final equation of \[ {score_{audience}}=42.42+0.19\times{runtime} \]

This is illustrated in the figure below.

ggplot(df, aes(x=runtime, y=audience_score))+geom_point(alpha=0.05)+geom_smooth(method="lm")+theme_bw(base_family="Avenir")+labs(x="Runtime in minutes", y="Audience score", title="")+guides(color=guide_legend(title="Director ever won an Oscar"))+theme(legend.position="bottom")

As it could be inferred in the model diagnostics, the final model has very low R-squared (0.03), which demonstrates that merely 3% of the variability of response variable(audience score) as regards the explanatory variable(runtime) can be explained by the model. Primarily the weak linearity between the explanatory variable and the response variable resulted in the poor fit of the model.

Nonetheless, this study demonstrated the results of linear regression when the conditions for the least squares line is not met.